Parameter estimation differences in the least squares and maximum likelihood paradigms are
the formulation of the objective function. In the least squares paradigm, the objective function is simply the
sum of squares of residuals. We can regard the boundary restriction of the dependent variable as the linear
constraints. Parameter estimation can be specified as a quadratic programming problem (QP):
(Vanderbei, 2008)
\begin{align*}
Minimize \qquad & f\left( \beta \right)=\sum\limits_{i=1}^{n}{\sum\limits_{t=1}^{{{T}_{i}}}{{{\left[ \left(
{{y}_{it}}-{{{\bar{y}}}_{i}} \right)-\left( {{x}_{it}}-{{{\bar{x}}}_{i}} \right)\beta \right]}^{2}}}} \\
Subject \quad to \qquad
&\qquad \left( {{x}_{it}}-{{{\bar{x}}}_{i}} \right)\beta \le b \\
&\quad -\left( {{x}_{it}}-{{{\bar{x}}}_{i}} \right)\beta \le -a,
\end{align*}
where the number of the spatial and temporal units is $n$ and ${{T}_{i}}$, and the parameter space
$\beta \in {{\Omega }_{\beta }}$. We have not exactly specified what ${{\Omega }_{\beta }}$ should be.
This is critical to solve the above QP problem successfully. We will discuss this issue in section 3.
We can modify the objective function slightly by providing a distributional assumption
to the demeaned dependent variable $\left( {{y}_{it}}-{{{\bar{y}}}_{i}} \right)$ in
(2.1)
\begin{equation}
\left( {{y}_{it}}-{{{\bar{y}}}_{i}} \right)\sim TN\left[ \left( {{x}_{it}}-{{{\bar{x}}}_{i}} \right)\beta ,
{{\sigma }^{2}};p_{1},q_{1} \right],
\tag{2.2}
\end{equation}
Here the setup of the lower and upper limits, $p_{1}$ and $q_{1}$, are vital to a successful maximum likelihood
estimation. Theoretically, $p_{1}$ and $q_{1}$ should be set to their greatest and least possible values according to the available data information.
\begin{align*}
& {{p}_{1}}=a-\left( \bar{\bar{y}}+t_{{{\sigma }_{b}}}^{\min }\cdot {{\sigma }_{b}} \right)\\
& {{q}_{1}}=b-\left( \bar{\bar{y}}+t_{{{\sigma }_{b}}}^{\max }\cdot {{\sigma }_{b}} \right).
\end{align*}
where ${{\sigma }_{b}}$ is the between-groups deviation estimated by the untruncated normal assumption,
and $t_{{{\sigma }_{b}}}^{\min }$ and $t_{{{\sigma }_{b}}}^{\max }$ refer to the largest deviation of ${{\bar{y}}_{i}}$
in the negative and positive directions, respectively.
Given this distributional assumption, the objection function can be specified as
\begin{equation*}
Maximize \qquad \log L\equiv -\sum\limits_{i=1}^{n}{\sum\limits_{t=1}^{{{T}_{i}}}{{{D}_{it}}-\frac{1}
{2{{\sigma }^{2}}}}}{{\left[ \left( {{y}_{it}}-{{{\bar{y}}}_{i}} \right)-\left( {{x}_{it}}-{{{\bar{x}}}_{i}}
\right)\beta \right]}^{2}},
\end{equation*}
where ${{D}_{it}}=\sqrt{2\pi }\sigma \left[ \Phi \left( \frac{p_{1}-\left( {{x}_{it}}-{{{\bar{x}}}_{i}}
\right)\beta }{\sigma } \right)-\Phi \left( \frac{q_{1}-\left( {{x}_{it}}-{{{\bar{x}}}_{i}} \right)\beta }
{\sigma } \right) \right]$.
We must be aware that one additional parameter $\sigma$ is added in the above likelihood function, and the
parameter space $\sigma \in {{\Omega }_{\sigma }}$ needs to be specified. The inequality constraints and the
parameter space $\beta \in {{\Omega }_{\beta }}$ remain the same.
From the perspective of the likelihood paradigm, the two objective functions above are all
problematic, since the panel regression is incorrectly specified in the first place.9
To see why this is so, we first assume that the dependent variable ${{y}_{it}}$ is distributed as truncated normal
\begin{equation*}
{{y}_{it}}\sim TN\left( {{\mu }_{i}},\sigma ;a,b \right),
\end{equation*}
where ${\mu_{i}}$ is the district-level location parameter.
The time mean of ${{y}_{it}}$ is
\begin{equation*}
{{E}_{t}}\left( {{y}_{it}} \right)={{\mu }_{i}}-\frac{\sigma \left\{ \exp \left[ -\frac{{{\left( b-{{\mu }_{i}}
\right)}^{2}}}{2{{\sigma }^{2}}} \right]-\exp \left[ -\frac{{{\left( a-{{\mu }_{i}} \right)}^{2}}}{2{{\sigma }^{2}}}
\right] \right\}}{\sqrt{2\pi }\left[ \Phi \left( \frac{b-{{\mu }_{i}}}{\sigma } \right)-\Phi \left( \frac{a-{{\mu }_{i}}}
{\sigma } \right) \right]}.
\end{equation*}
Unless $b-{{\mu }_{i}}={{\mu }_{i}}-a$ or $\left( a,b \right)\to \left( -\infty ,\infty \right)$, the time
mean ${{E}_{t}}\left( {{y}_{it}} \right)$ (or noted as ${{\bar{y}}_{i}}$) is a biased estimate of ${{\mu }_{i}}$
(Johnson and Kotz, 1970: 81).10 If we intend to
purge the between-groups variation, we should subtract ${\mu_{i}}$ from ${{y}_{it}}$. However, we mistakenly use
the time mean ${{E}_{t}}\left( {{y}_{it}} \right)$ to estimate ${\mu_{i}}$, and the differencing operation
${{y}_{it}}-{{E}_{t}}\left( {{y}_{it}} \right)$ fails to generate the valid within-groups variation. This illustrates
the common problem for the previous panel regressions in (2.1) and
(2.2).
If we want to specify a conceptually equivalent model as the panel regression, we can use the
individual-level dependent variable to estimate district-level location parameters ${{\mu }_{i}}$, and then perform
the demeaning operation to derive the within-groups regression.11 In this scenario,
the dependent variable can be specified as
\begin{equation*}
\left( {{y}_{it}}-{\hat{\mu }_{i}} \right)\sim TN\left( {{x}^{*}_{it}}\beta ,{{\sigma }^{2}};p_{2},q_{2} \right),
\end{equation*}
where ${{x}^{*}_{it}}$ represents the covariate matrix that is fixed at the minimum after being demeaned, and
\begin{align*}
{{p}_{2}}&=a-\left( \hat{\mu }+t_{{{\sigma }_{{{{\hat{\mu }}}_{i}}}}}^{\min }\cdot
{{\sigma }_{{{{\hat{\mu }}}_{i}}}} \right)\\
{{q}_{2}}&=b-\left( \hat{\mu }+t_{{{\sigma }_{{{{\hat{\mu }}}_{i}}}}}^{\max }\cdot {{\sigma }_{{{{\hat{\mu }}}_{i}}}}
\right).
\end{align*}
Applying a maximum likelihood estimation,we can derive the objective function as
\begin{equation*}
Maximize \qquad \log L\equiv -\sum\limits_{i=1}^{n}{\sum\limits_{t=1}^{{{T}_{i}}}{\left\{ {{D}_{it}}-\frac{1}
{2{{\sigma }^{2}}}{{\left[ \left( {{y}_{it}}-{{{\hat{\mu }}}_{i}} \right)-{{x}_{it}}\beta \right]}^{2}} \right\}}},
\end{equation*}
where ${{D}_{it}}=\sqrt{2\pi }\sigma \left[ \Phi \left( \frac{p_{2}-x_{it}^{*}\beta }{\sigma } \right)-\Phi
\left( \frac{q_{2}-x_{it}^{*}\beta }{\sigma } \right) \right]$. Since the demeaning operation is achieved with
the maximum likelihood estimates of ${{\hat{\mu }}_{i}}$, no demeaning on the covariate matrix is necessary.
However, we will apply the same demeaned specification for the sake of comparability. In section 3, we will
apply the technique of constrained optimization to solve the three optimization problems given different parameter constraints.
9 In contemporary statistical science, the likelihood theory is a crucial paradigm of inference for data analysis (Royall, 1997:xiii). It provides a unifying approach of statistical modeling to both frequentists and Bayesians with the criterion of maximum likelihood (Azzalini, 1996). The rapid development of political methodology in the last two decades has also witnessed the establishment of the likelihood paradigm in the scientific study of politics (King, 1998). As a model of inference, the fundamental assumption of the likelihood theory is the likelihood principle, which states that "all evidence, which is obtained from an experiment, about an unknown quantity $\theta$, is contained in the likelihood function of $\theta$ for the given function." (Berger and Wolpert,1984:vii) In other words, given the fact that the likelihood function is defined by the probability density (or mass) function, we must make a distributional assumption of the dependent variable to derive a likelihood function. The plausibility of such a distributional assumption is therefore vital to the validity of the statistical inference.
10 When $b-{{\mu }_{i}}={{\mu }_{i}}-a$, the normal distribution is evenly truncated at both ends. When $\left( a,b \right)\to \left( -\infty ,\infty \right)$, the variable is not truncated at all. Both situations rarely occur when the dependent variable is distributed as truncated normal.
11 This involves a two-stage procedure. In the first stage, ${\hat{\mu }_{i}}$ is estimated by ${{\mu }_{it}}$ without covariates. In the next stage, we take ${\hat{\mu }_{i}}$ as the district-level property and subtract it to derive complete within-groups deviation.